Running Flow

This is an example pipeline to run flow of two tumor/normal pairs through the hg19 pipeline for JaBbA. There are a few steps to setting up and running these jobs. After getting through the pipeline we we run a job post JaBbA creation and work towards visualizing some of the plots.

The pipeline that we will run through is covered here:

Here we align fastq files into bams for tumor/normal paired samples for WGS. Next, we run a series of jobs that are required inputs for the JaBbA pipeline. However, since the alignment of WGS samples can take upwards of a week, we will start off with a couple aligned hg19 BAM files and begin to work from there.

Our Pairs Table

This is our pairs table in the purest form, where we have bam files for the associated tumor and normal sample. We will run the samples through the pipeline:

pairs = readRDS("db/pairs.rds")
pairs
##        pair
## 40 25008780
## 49 25023660
##                                                                                                                         normal_bam
## 40 /gpfs/commons/groups/imielinski_lab/projects/DavisBRCA/Flow/ChaoJul2022/BWAMem_Fast/Sample_25008780-151/Sample_25008780-151.bam
## 49 /gpfs/commons/groups/imielinski_lab/projects/DavisBRCA/Flow/ChaoJul2022/BWAMem_Fast/Sample_25023660-549/Sample_25023660-549.bam
##                                                                                                                            tumor_bam
## 40 /gpfs/commons/groups/imielinski_lab/projects/DavisBRCA/Flow/ChaoJul2022/BWAMem_Fast/Sample_25008780-DNAt/Sample_25008780-DNAt.bam
## 49    /gpfs/commons/groups/imielinski_lab/projects/DavisBRCA/Flow/ChaoJul2022/BWAMem_Fast/Sample_25023660DNAt/Sample_25023660DNA.bam

There is intentionally something wrong with one of these samples: try and figure this out!

BAM jobs

Once we have our generated bams we are going to run a series of jobs on them to generate the prerequisites for running JaBbA. There are a series of jobs that will need to done on the bams and they can all be ran in tandem. So let’s try and get one of these going.

The first one is the hetpileup caller, which generates a pile up of our allele fractions for a sample.

cat  ~/tasks/hg19/core/HetPileups.task
## # HetPileupsWGS urn:lsid:broadinstitute.org:cancer.genome.analysis:10898:6
## ~/modules/Pileup
## input    TumorBam    tumor_bam_wgs   path
## input    KeepOnlyHets    "TRUE"  value
## input    NormalBam   normal_bam_wgs  path
## input    NumCores    cores   value   "4"
## input    MaxDepth    max_depth   value   "1000"  ## default is 500 (from bamUtils::mafcount > bamUtils::varcount
## input    SitesVCF    hapmap.sites    path    "/gpfs/commons/groups/imielinski_lab/DB/references/hg19/hapmap_3.3.b37.vcf.gz"
## output    het_pileups_wgs     sites.txt$

Here we see a few arguments that will need to be inputed, and there are a couple columns in our pairs table that may need to be renamed. The task text table format is something like this:

col1 col2 col3 col4 col5
input or output value name inside the Job class input label/colname of the file class expected of input any default argument

For more information on how to set up jobs, the wiki on Flow is a good source to follow.

These jobs/tasks that need to be run on the bams and can all be done in tandem. Expect if all things go well with all the jobs that it should be done within a few hours.

Here’s a table of the BAM jobs to run:

tool task location description
Svaba ~/tasks/hg19/core/Svaba.task SV caller
Strelka ~/tasks/hg19/core/Strelka2 SNV caller
fragcounter (on tumor) ~/tasks/hg19/core/fragCounter.task fragment counting across the genome with GC and mappability corrections
fragcounter (on normal) ~/tasks/hg19/core/fragCounter.normal.task same as above but for normal
HetPileup ~/tasks/hg19/core/HetPileups.task Get allele fractions for het sites across genome

After all of these steps are ran and processed, remember to merge their outputs together and save this into our pairs table.

Alternatively, one can also use GRIDSS/GRIPSS as a SV and junction breakpoint detector.

Working with the outputs.

If all things are working well, we can begin processing some of the bam outputs in preparation for JaBbA. One of these is dryclean, which further processes the fragcounter outputs to further remove noise and signal via rPCA using a PON.

Dryclean

Dryclean our fragcounter outputs for both tumor and normal.

The tasks associated with dry clean:

dryclean normal: ~/tasks/hg19/core_dryclean/Dryclean.normal.task

dryclean tumor: ~/tasks/hg19/core_dryclean/Dryclean.task

Then we use CBS to get our segments ready for JaBbA.

After this step is done, run CovCBS: ~/tasks/hg19/core/CBS.task

MERGE ALL OF THIS INTO YOUR PAIRS TABLE BEFORE PROCEEDING.

Optional Job Ascat

Ascat purity/ploidy estimation and other estimations can be done separately here. This job is not necessary as JaBbA will do its own estimation, but it could help with a few fringe cases.

The task is: ~/tasks/ascat_seg.task

Running JaBbA

Now that we have all the right inputs, it’s time to run JaBbA.

The task is located here: ~/tasks/JaBbA_ZC_dev.task

and looking into the task:

Flow::Task("~/tasks/JaBbA_ZC_dev.task")
## No methods found in package 'data.table' for request: 'key' when loading 'Flow'
## #Module JaBbA_ZC_dev [Task: JaBbA_ZC_dev path: ~/tasks/JaBbA_ZC_dev.task] ("sh <libdir>/run.sh <RAfile> <CovFile> --verbose --name <TumorName> --seg <SegFile> --field <CovField...")
## ~/modules/JaBbA_ZC_dev/
## input    CovFile cov_rds path    
## input    RAfile  junctionFilePath    path    
## input    CovField    field   value   ratio
## input    RAsupp  junctionUnfiltered  path    /dev/null
## input    TierFieldName   tfield  value   tier
## input    NormalSegFile   cbs_nseg_rds    path    /dev/null
## input    SegFile cbs_seg_rds path    /dev/null
## input    SlackPenalty    slack   value   100
## input    OptionalHetPileupOutput het_pileups_wgs path    /dev/null
## input    Purity  purity  value   NA
## input    Ploidy  ploidy  value   NA
## input    tilim   tilim   value   6000
## input    epgap   epgap   value   1e-8
## input    ppmethod    pp.method   value   ppgrid
## input    maxna   maxna   value   0.8
## input    flags   flags   value   
## input    blacklist.coverage  blacklist.coverage  path    /dev/null
## input    blacklist.junctions blacklist.junctions path    /dev/null
## input    NumIterations   iter    value   0
## input    TumorName   pair    value   tumor
## input    job.spec.memory "15"    value   
## input    indel   indel   value   exclude
## input    cnsignif    cnsignif    value   0.00001
## input    lp  lp  value   TRUE
## input    ism ism value   TRUE
## input    mem treemem value   16
## input    fix.thres   fix.thres   value   -1
## input    gurobi  gurobi  value   FALSE
## input    nonintegral nonintegral value   FALSE
## output    jabba_rds   jabba.simple.rds$
## output    jabba_gg    jabba.simple.gg.rds$
## output    jabba_vcf   jabba.simple.vcf$
## output    jabba_raw_rds   jabba.raw.rds$
## output    opti    opt.report.rds$
## output    jabba_seg   jabba.seg$
## output    karyograph      karyograph.rds$

We have bunch of values that will need to be filled. Here are some of the inputs that will need to be altered to:

  1. cov_rds should be tumor_dryclean_cov
  2. junctionFilePath should be svaba_somatic_vcf
  3. CovField should be foreground if you are following the blog file
  4. cbs_seg_rds should match cbs outputs from the merge.
  5. cbs_nseg_rds should match cbs outrputs from the merge.
  6. het_pileups_wgs should be the het_pileup sites.txt output for that file.

Purity and ploidy values can be precalculated via the ascat job and inputted here; otherwise the task and JaBbA will estimate these for you.

There are a few other estimated parameters that can be played around with for the fits, and in your free time, I would recommend playing around with these to see how the solver is affected by the various arguments.

From here, run the program, and we should be good to go. Expect each pair sample to take at most a few hours. There will occasionally be cases where the solver in JaBbA fails to converge, and it will become important to play around with parameters to solve for cases like these.

Running JaBbA jobs

Most post-JaBbA tasks are either related to the sample as a whole or the ggraph structure generated by the JaBbA job. Here we will go into a simple post JaBbA task which is “event calling” via gGnome. We run the task ~/tasks/Events.task which calls the event caller from gGnome. From here the caller uses set thresholds and criteria to identify the type of classes and events are found within each sample.

Most of these jobs like Fusions, Alterations, and Events are quick, so we can expect a reasonably fast turnaround.

Plotting

Finally, plotting is done via gGnome. Let’s try and plot some of our JaBbA outputs.

Use the gGnome tutorial for reference: Link

Using the Complex JaBbA output from the Events job:

  1. Plot a basic SV event via gGraph.

  2. Tweak the parameters around the range of the event to increase by 10K bp on each side

  3. Generate a gWalk of one of these SV events.

  4. Plot and save these plots via ppng or ppdf on mskiweb.

Resources and More Reading:

Read into these in your free time!

101 on Flow Usage: http://mskiweb/101/flow.html

Links to Tools within the pipeline:

  1. Svaba
  2. Strelka2
  3. fragcounter
  4. Hetpileups is an internal tool, script is found here: ~/modules/Pileup/Pileup.R
  5. dryclean
  6. CBS
  7. ascat
  8. JaBbA
  9. gGnome